10 (a) Let us use the boston Dataset
In [1]:
library(MASS)
head(Boston)
# ?Boston for information
How many rows and columns are in this dataset?
In [2]:
dim(Boston)
This Boston dataset has 506 sample rows and 14 columns (fields). Each row represents a Suburb in Boston and each column is a property of the suburb that helps determine the house pricing (which is the response variable) in the area.
10 (b) Let us create some pairwise scatter plots
In [22]:
pairs(Boston)
Since there are 14 predictors, scatter plot matrix becomes nearly illegible. Instead, we will get a birds eye view of our data using a correlation matrix.
In [12]:
corr_matrix = cor(Boston, method="pearson") # Generate Correlation Matrix
corr_matrix
The following observations were made:
10(c) From the correlation matrix and the scatter plot, we can make the following observations about crime rate crim.
10 (d) The names of suburbs are not given in the dataset. Finding the suburbs with high crimerates, tax rates or pupil teacher ratios can only be done relatively using a histogram. Let us determine the distribution of crime rates for the suburbs in our dataset.
In [24]:
hist(Boston$crim, breaks=20, xlab="Crime Rate", main="Histogram of Crime Rates")
It is clear that most of the suburb samples have low crime rates. Now Lets take a look at tax rates.
In [25]:
hist(Boston$tax, breaks=20, xlab="Tax Rate", main="Histogram of Tax Rates")
There are a lot of houses with high tax rates up to 440, and then we have a lot of suburbs with tax rates around 680. Not many suburbs have tax rates between these figures. Let us now take a look at pupil teacher ratios.
In [26]:
hist(Boston$ptratio, breaks=20, xlab="Pupil Teacher Ratio", main="Histogram of Pupil Teacher Ratios")
The histogram for pupil teacher ratios seems well distributed, but with a particularly high ratio around 20 to 20.5. Let us find out exactly how many such suburbs exist.
In [27]:
length(Boston$ptratio[20 < Boston$ptratio & Boston$ptratio < 20.5])
So there are 145 suburbs in our dataset of 506 that have a high pupil teacher ratio between 20 and 20.5. Pretty Interesting!
10 (e) Let us determine the number of suburbs that bound the Charles River.
In [28]:
length(Boston$chas[Boston$chas == 1])
So there are 35 rivers that Bound Charles.
10(f) What is the median pupil teacher ratio for the towns in this dataset?
In [29]:
median(Boston$ptratio)
10 (g) Let us now find the suburb of Boston has lowest median value of owner-occupied homes.
In [30]:
index = which.min(Boston$medv) #Get index minimum medv
Boston[index,] #Access this row.
So the 399th suburb in the dataset has the lowest median value for owner occupied homes (medv = 5). Let us see the nature of the other fields.
In [31]:
percentile = ecdf(Boston$crim) #ecdf takes a vector and returns function for computing percentile.
print(paste("Crime Rate = ", percentile(Boston[index,'crim'])))#We can now compute the "percentile" of a value
Let us iterate over all fields fast to get the big picture.
In [32]:
fields = names(Boston)
for (field in 1:length(fields)){
percentile = ecdf(Boston[[field]])
print(paste(fields[field], " = ", percentile(Boston[index,'crim'])))
}
Wow! These values are extreme relative to the dataset. From this result, we can conclude that this suburb with the lowest median value of owner-occupied homes also has:
10 (h) Let us find the number of suburbs that average over 7 rooms per dwelling.
In [34]:
length(Boston$rm[Boston$rm > 7])
So around 64 suburbs have greater than 7 rooms per dwelling on average. Now let us see how many suburbs exceed 8.
In [35]:
length(Boston$rm[Boston$rm > 8])
13 suburbs in our dataset have greater than 8 rooms per dwelling on average. Let us see what kind of suburbs these are.
In [36]:
Boston[Boston$rm > 8,]
In [37]:
summary( Boston[Boston$rm > 8,] )
Let us try to compare these stats to those of the entire dataset.
In [38]:
summary(Boston)
Some noticible factors for the suburbs with over 8 rooms per dwelling on average: